# Loading packages
library(tidyverse)
library(knitr)
library(DESeq2)
library(magrittr)
library(pheatmap)
library(EnhancedVolcano)
library(org.Hs.eg.db)
library(AnnotationDbi)
library(clusterProfiler)
library(treemap)
Systemic lupus erythematosus (SLE) is a complex and heterogeneous autoimmune disease characterized by widespread inflammation, autoantibody production, and multi-organ involvement. A key molecular feature of SLE is the overactivation of the type I interferon (IFN-I) pathway, which contributes to the dysregulation of both innate and adaptive immune responses. Understanding how this transcriptional dysregulation manifests in immune cell subsets is essential for identifying disease biomarkers and therapeutic targets.
In this study, I explored transcriptomic differences in CD4+ T cells isolated from patients with SLE and healthy individuals. Our analysis focused on identifying differential gene expression (DGE) and enriched biological processes, with the goal of better understanding the immune landscape and molecular mechanisms underlying SLE.
We analyzed data from the GEO dataset GSE97263, which is also publicly available through the European Nucleotide Archive (ENA Project: PRJNA381147). This dataset originates from a study titled “Type I interferons affect the metabolic fitness of CD8+ T cells from patients with systemic lupus erythematosus”, published in Nature Communications in 2021 by Shen et al. The research was conducted by Marina Botto and Chia-Lin Shen at Imperial College London, who generated and deposited the raw sequencing data. Although the original study focused on the impact of IFN-I on the metabolic profile of CD8+ T cells, the dataset also includes paired data from CD4+ T cells, which are a critical component of adaptive immunity and highly relevant to SLE pathogenesis.
CD4+ T cells were purified from buffy coat cones obtained from healthy donors and patients diagnosed with SLE. Cell sorting was performed using the human CD8+ and CD4+ T Cell Isolation Kits (Miltenyi Biotec). For RNA extraction, the authors used Qiagen’s RNeasy Mini Kit, incorporating an on-column DNase treatment to remove genomic DNA. RNA integrity was confirmed via Agilent Bioanalyzer with RIN scores ≥ 9.0 for all samples, ensuring high-quality inputs for sequencing.
The RNA-seq libraries were prepared using the TruSeq Stranded mRNA HT kit (Illumina) with poly-A selection to isolate mRNA. The protocol involved RNA purification and fragmentation using poly-T oligo-attached magnetic beads, followed by first and second-strand cDNA synthesis. The cDNA 3′ ends were adenylated, adapters were ligated, and libraries were amplified through 11 cycles of PCR. Libraries were size-selected using AMPure XP Beads and verified for quality before sequencing. The libraries were randomized across lanes to minimize batch effects and sequenced on the Illumina HiSeq 2500 platform using 100 bp paired-end reads, spanning 12.5 total lanes.
The final dataset includes:
• 14 healthy controls • 14 patients with active SLE • 16 patients with less active SLE
For the purposes of this project, I focused on comparing gene expression in CD4+ T cells between patients with active SLE and healthy controls.
Our central hypothesis is that CD4+ T cells from patients with active SLE exhibit significant transcriptomic differences, particularly involving interferon response pathways, viral mimicry programs, and immune cell activation states, when compared to healthy controls. I anticipate that type I interferon-stimulated genes and immune activation markers will be upregulated in SLE, while other homeostatic pathways may be downregulated. Through differential gene expression analysis and functional enrichment, I aim to uncover both known and novel gene programs associated with SLE pathology.
I begin by accessing cayuga and going to my scratch directory. This is where I will conduct and store all my work.
ssh cayuga
angsd
I then create a new directory for the project using the
makedir prompt, and cd into it. I make a new
folder for the data using the same prompt and cd into
it.
mkdir project
cd project
mkdir data
cd data
I then navigate to the European Nucleotide Archive (ENA) website, and use the search function to find the correct page correlating to the data (PRJNA381147). Simultaneously, I have the SRA Run Selector open, and I use it to match the Run Accessions for the active SLE subjects and the control subjects on ENA. I select the accessions corresponding to forward and reverse samples of active SLE patients and controls on ENA (excluding less active SLE samples), and I press on the button to ‘Get download script’.
The downloaded script is comprised of lines using the
wget command to download the relevant forward and reverse
fastq files Using this script, I make the the first slurm script and
save as 01_sra_download.sh.
I then run chmod and sbatch to submit the
task:
chmod +x 01_sra_download.sh
sbatch 01_sra_download.sh
This downloads all the FastQ files I need for my project. I then
write another script 02_rename. I similarly run
chmod and then run the script.
chmod +x 02_rename.sh
./02_rename.sh
I wanted to run FastQC on one sample to get an initial understanding
on the quality of the sample and to understand whether trimming is
necessary. I therefore activate the angsd conda environment
and run fastqc per below:
conda activate angsd
fastqc Ctrl_1_f.fastq.gz
Below are some of the resulting QC graphs, and relevant interpretations.
include_graphics("graphics/per_sequence_quality.png")
include_graphics("graphics/per_sequence_GC_content.png")
include_graphics("graphics/per_base_sequence_content.png")
The “Per sequence quality scores” plot shows excellent overall read quality. The distribution of mean Phred scores per read is tightly centered around 37–38, indicating that the vast majority of reads have very high base call accuracy (Phred > 30 corresponds to 99.9% accuracy). There are virtually no reads with low average quality, and the green checkmark confirms that this metric passes quality thresholds. This suggests the sequencing run was highly successful.
The “Per sequence GC content” plot displays a nearly perfect normal distribution centered around ~50% GC content, which is consistent with the expected GC content for human RNA. The observed GC distribution (purple) closely matches the theoretical distribution (blue), suggesting that there is no GC bias or contamination in the sample. The green checkmark further confirms that this metric passes quality control. This indicates a well-prepared library and high-quality sequencing with uniform base composition across reads.
The “Per base sequence content” plot for ctrl_1_f shows a strong nucleotide imbalance at the beginning of the reads, which is expected in RNA-seq data due to biases introduced during random hexamer priming and fragmentation in library preparation. While this results in a QC warning, it is not indicative of poor data quality. The sequence composition quickly stabilizes after the first ~10 bases, and this pattern is typical for libraries prepared using standard RNA-seq protocols like TruSeq Stranded mRNA.
According to the QC “Adapter Content” chart below, all lines, representing known adapters, stays at very bottom around ~0%, therefore no adapter trimming is required. Moreover, trimming is not as essential before using STAR because it automatically soft-clips low-quality or adapter-contaminated ends during alignment.
include_graphics("graphics/adapter_content.png")
I start off by building the reference genome. I use the following commands to navigate to an appropriate directory:
cd project
mkdir reference
cd reference
I then download the reference genome annotation and assembly using
the following wget commands:
wget -O gencode.v47.basic.annotation.gtf.gz \
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.basic.annotation.gtf.gz
wget -O GRCh38.primary_assembly.genome.fa.gz \
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh38.primary_assembly.genome.fa.gz
I then unzip using gunzip per below:
gunzip GRCh38.primary_assembly.genome.fa.gz
gunzip gencode.v47.basic.annotation.gtf.gz
I create the following directories to store all the project outputs.
mkdir project_outputs
cd project_outputs
mkdir aligned_reads
cd aligned_reads
I then move forward with aligning all the reads using STAR. This
process is completed via a slurm script named
03_batch_align.sh. Similarly, I run the below to submit
this script:
chmod +x 03_batch_align.sh
sbatch 03_batch_align.sh
For the purposes of this report, please find just the
STAR command within the slurm script below, with an
explanation of each parameter.
STAR --runThreadN 20 \
--genomeDir $GENOME_DIR \
--readFilesIn $FORWARD $REVERSE \
--readFilesCommand zcat \
--outFileNamePrefix $PREFIX \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--outFilterMultimapNmax 1 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignSJoverhangMin 8
--runThreadN 20: Runs STAR with 20 threads,
significantly accelerating alignment by parallelizing the workload
across CPU cores.--genomeDir: Specifies the directory containing the
pre-built STAR genome index. This index is essential for mapping reads
to the reference genome efficiently.--readFilesIn: Indicates that the input data are
paired-end reads. Both the forward (_1.fastq.gz) and
reverse (_2.fastq.gz) files are provided here.--readFilesCommand zcat: Decompresses gzipped FASTQ
files on-the-fly using zcat, allowing STAR to read compressed input
files without requiring manual decompression beforehand.--outFileNamePrefix: Sets the prefix for all STAR
output files (e.g., BAM, logs, etc.), organizing them in the designated
output directory with a consistent naming scheme.--outSAMtype BAM SortedByCoordinate: Instructs STAR to
output alignments in BAM format, sorted by genomic coordinate. This
format is standard for downstream tools like QoRTs and featureCounts
which I will do later.--quantMode GeneCounts: This parameter instructs STAR
to perform basic gene-level quantification during the alignment step. It
generates a file (ReadsPerGene.out.tab) that reports raw read counts for
each gene across four columns: the first column lists the gene ID,
followed by counts assuming unstranded data, counts for strand-specific
data (forward strand), and counts for strand-specific data (reverse
strand). STAR quantifies reads by overlapping aligned reads with exonic
regions defined in the provided GTF annotation file. A read is counted
only if it maps unambiguously to a single gene’s exons; reads
overlapping multiple genes or mapping to non-exonic regions are not
counted. While this approach does not provide isoform-level resolution,
it is a fast and convenient method to obtain gene-level counts and
assess annotation overlap.--outFilterMultimapNmax: This is a highly stringent
filter that only retains reads mapping to a single genomic location. Any
read that maps to more than one locus (even just two) is discarded. This
strict filter reduces ambiguity and improves alignment
specificity—critical for downstream analyses like differential
expression. This can lead to the exclusion of legitimate reads that
align to homologous genes, but ultimately I prioritized unambiguous,
high-confidence alignments for clearer downstream interpretation.--outFilterMismatchNoverReadLmax: Discards reads with
more than 4% mismatches relative to their length. This ensures that only
high-quality, confidently aligned reads are kept, improving mapping
accuracy at the cost of sensitivity.--alignSJoverhangMin: Requires a minimum 8-nucleotide
overhang on each side of a splice junction for a read to be considered a
valid splice junction read. This improves the accuracy of intron/exon
boundary detection, especially important for spliced transcriptomes in
eukaryotic RNA-seq data.I will assess the quality of the alignment using MultiQC
in a later stage.
I then index all my reads using the slurm script
04_index_bams.sh, which I execute by following the
below:
chmod +x 04_index_bams.sh
sbatch 04_index_bams.sh
I now use QoRTs to provide a comprehensive quality
control metrics beyond basic alignment stats, including gene body
coverage, splice junction annotation, and strand specificity checks.
Running it helps ensure that the RNA-seq data is high-quality and
suitable for downstream differential expression and splicing analyses. I
run QoRTs by writing a slurm script 05_run_qorts_all.sh,
which I save within a new directory qorts_out inside the
project_outputs directory. I execute the script using the
below
chmod +x 05_run_qorts_all.sh
sbatch 05_run_qorts_all.sh
For the purposes of this report, please find just the
QoRTs command within the slurm script below, with an
explanation of each parameter.
java -Xmx96G -jar "$QORTS_JAR" QC \
--generatePlots \
"$bam" \
"$GTF" \
"$out_dir"
java -Xmx96G: Allocates up to 96 GB of memory for the
Java Virtual Machine (JVM), which is important for processing large BAM
files efficiently without memory-related failures.-jar "$QORTS_JAR": Specifies the QoRTs executable JAR
file, pointed to by the environment variable $QORTS_JAR. This file was
located in
/athena/angsd/scratch/mef3005/share/envs/qorts/share/qorts-1.3.6-1/QoRTs.jarQC: Runs the Quality Control (QC) mode of QoRTs, which
performs extensive quality checks on aligned RNA-seq data, generating
numerous metrics and intermediate outputs.--generatePlots: Enables automatic generation of plots
(e.g., splice junction profiles, alignment distributions, gene body
coverage) based on the QC results. These visualizations are useful for
quickly identifying issues with alignment, library prep, or
annotation."$bam": The input BAM file containing aligned RNA-seq
reads, sorted by coordinate. This file is required for all QoRTs
analyses."$GTF": The reference gene annotation file in GTF
format, used to interpret alignments in a genomic context (e.g.,
identifying exons, genes, splice junctions)."$out_dir": The output directory where all QC metrics,
tables, and plots will be saved.The script runs successfully. I will assess the quality of the
alignment using MultiQC in a later stage.
Now I want to use featureCounts as it efficiently
quantifies how many reads align to each gene, producing a gene-level
count matrix that’s essential for downstream differential expression
analysis. I plan to include this in the MultiQC report and
use the generated txt to perform downstream analyses. I
write a slurm script 06_run_featurecounts.sh to execute
this. I save this script in a new directory under the
project_outputs directory named
feature_counts.
For the purposes of this report, please find just the
featureCounts command within the slurm script below, with
an explanation of each parameter.
featureCounts -T 24 -p --countReadPairs \
-s 2 \
-a "$GTF_FILE" \
-o "$OUTFILE" \
--minOverlap 10 \
"$ALIGN_DIR"/Ctrl_*_Aligned.sortedByCoord.out.bam \
"$ALIGN_DIR"/SLE_*_Aligned.sortedByCoord.out.bam
-T 24: Specifies the number of threads to use (in this
case, 24), enabling faster processing by leveraging parallel
computing.-p: Tells featureCounts to count fragments (i.e., read
pairs) instead of individual reads. This is appropriate for paired-end
sequencing data.--countReadPairs: Ensures only properly paired reads
are counted as a fragment. This avoids double-counting each end of a
pair.-s 2: Indicates the data is stranded and in reverse
orientation (appropriate for TruSeq Stranded mRNA kits).-a "$GTF_FILE": Specifies the path to the reference
gene annotation file (in GTF format), used to assign reads to genomic
features (e.g., exons, genes).-o "$OUTFILE": Path to the output file where the read
count matrix will be saved.--minOverlap 10: Requires at least 10 base pairs of
overlap between a read and a feature (e.g., exon) for the read to be
counted. This filters out spurious or partial alignments."$ALIGN_DIR"/Ctrl_*_Aligned.sortedByCoord.out.bam and
"$ALIGN_DIR"/SLE_*_Aligned.sortedByCoord.out.bam: These are
the input BAM files for both control and SLE samples, sorted by
coordinate (as required by featureCounts).The script executes successfully, which I will include in the
MultiQC below.
I now use MultiQC to aggregate and summarize output from
STAR, featureCounts, and QoRTs
into a single, interactive HTML report, which I will include in my
project folder. I run MultiQC using the below, which I save
under a multiqc_report directory in the
project_outputs directory.
conda activate multiqc
multiqc /athena/angsd/scratch/par4012/project/project_outputs \
-o /athena/angsd/scratch/par4012/project/project_outputs/multiqc_report
The graphs from MultiQC can be seen below:
include_graphics("graphics/general_statistics.png")
include_graphics("graphics/star_alignment_plot.png")
include_graphics("graphics/star_gene_counts.png")
include_graphics("graphics/qorts_splice_loci.png")
include_graphics("graphics/qorts_splice_events.png")
include_graphics("graphics/qorts_strand_test.png")
include_graphics("graphics/qorts_alignments.png")
include_graphics("graphics/featureCounts_assignment_plot.png")
General Statistics: The general statistics summary in MultiQC offers a high-level overview of the sequencing and mapping quality across all samples. Most samples in the dataset demonstrate excellent alignment percentages, with the “% Assigned” values—the proportion of reads successfully assigned to annotated features—ranging between approximately 70% and 85%. This suggests that the combination of library preparation, read quality, and genome annotation was well-suited for this dataset, enabling accurate transcript quantification. Coverage across chromosomes is also consistent, with nearly all samples spanning over 50 chromosomes, ensuring broad representation across the genome and reducing the risk of systemic mapping biases.
Of note is the sample Ctrl_06, which differs from other samples primarily due to its unusually high total number of reads. While it may initially appear to be an outlier, this deviation is probably attributable to higher sequencing depth rather than a quality failure. Its elevated raw counts in many metrics reflect deeper coverage and thus should not be interpreted as a defect but rather as a feature of the dataset that may require normalization downstream. Similarly, while some samples show slightly lower “% Genes with Counts” values, this likely reflects variability in transcript diversity or sequencing depth across individual patients and does not raise immediate concern about sample integrity.
STAR QC Metrics: The STAR alignment quality plots further reinforce the robustness of the dataset. The majority of reads across all samples fall into the “Uniquely Mapped” category (dark blue), which is the most desirable outcome in RNA-seq alignment. This indicates that STAR, using our specified parameters and a well-indexed reference genome, successfully aligned most reads to unique genomic loci. A modest proportion of reads are categorized as “Mapped to too many loci” or “Unmapped: too short,” which is expected and reflects the presence of low-complexity regions or shorter fragments that inherently pose alignment challenges.
Ctrl_06, while visually distinct in these plots, again reflects a much higher total read count. This sample exhibits more absolute reads in each mapping category—notably more reads that are too short or mapped to multiple loci—but their relative proportions are consistent with the rest of the dataset. These findings suggest that Ctrl_06 has not failed quality control; rather, its patterns simply represent the effect of high coverage. STAR gene count breakdowns mirror this pattern: while Ctrl_06 has higher absolute counts in “Unmapped” and “Ambiguous” reads, it also has a significantly higher total number of mapped reads, and therefore, relative proportions remain within acceptable bounds. No sample stands out for poor mapping efficiency or systematic annotation issues.
QoRTs Quality Control: QoRTs results offer complementary insight into read distribution across genic and intergenic regions, splicing coverage, and strand specificity. In the alignment location plot, most samples show consistent and desirable patterns—high proportions of reads mapping to coding regions (CDS) and untranslated regions (UTRs), with modest levels of reads aligning to intronic or intergenic regions. Ctrl_06 again shows elevated read counts across all categories, but this is consistent with its deeper sequencing depth and not indicative of misalignment. Notably, it retains a high number of reads in annotated genic regions, supporting its inclusion in downstream analyses.
The splice loci and splice events panels display reassuring consistency across all samples. Both control and SLE groups show abundant read support for known splice junctions, with only small contributions from novel loci or low-confidence regions. This consistency across sample groups indicates effective and uniform library preparation and supports the hypothesis that biologically meaningful differences—rather than technical artifacts—will drive differential splicing results.
Finally, the strand specificity plot confirms that the dataset is
overwhelmingly stranded in the fr-firststrand direction, in line with
the TruSeq Stranded protocol used in the original study. Nearly all
reads across all samples map to the expected strand, validating the use
of strand-aware quantification methods (e.g., -s 2 in
featureCounts) and increasing confidence in
annotation-driven analyses. No samples show a shift toward the wrong
strand or exhibit ambiguous strand behavior, further affirming technical
consistency.
featureCounts: The featureCounts plot, summarizing assignment rates across all samples, confirms that the majority of reads in each sample were successfully assigned to annotated gene features. Most samples show large light blue bars representing assigned reads, while unassigned categories—particularly “No Features” and “Ambiguity”—are minimal or moderate. As expected, slight differences in unassignment rates occur across samples due to biological variability and sample-specific expression complexity.
Ctrl_06 again shows markedly higher overall read counts across all categories. While it has more unassigned reads in absolute terms, the relative proportions remain consistent with other samples. There is not enough evidence of a technical error or systematic misannotation. Instead, its greater read depth offers potential analytical advantages, such as increased sensitivity for detecting lowly expressed genes.
Overall, the featureCounts results reinforce the conclusion that the dataset is of high technical quality. All samples fall within expected ranges for assignment and ambiguity, and none display catastrophic loss of annotation fidelity.
Raw read counts were obtained from the txt output of the
featureCounts tool and imported into R for downstream
differential expression analysis. The following steps were taken to
clean and prepare the data:
# Read the raw count matrix
count_data <- read.delim("counts.txt", comment.char = "#")
# Clean up the gene annotation
rownames(count_data) <- count_data$Geneid
count_data <- count_data[, -(1:6)] # remove annotation columns
# Check column names
colnames(count_data) <- gsub(".bam", "", colnames(count_data)) # simplify sample names
To enable accurate grouping of samples by condition, regular expressions were applied to simplify sample names and extract condition metadata:
colnames(count_data) <- gsub(".*aligned_reads\\.(.*)_Aligned.*", "\\1", colnames(count_data))
sample_names <- colnames(count_data)
condition <- ifelse(grepl("Ctrl", sample_names), "Ctrl", "SLE")
col_data <- data.frame(row.names = sample_names, condition = factor(condition, levels = c("Ctrl", "SLE")))
The count matrix and associated sample metadata were used to
construct a DESeqDataSet object, which was then normalized
using the DESeq2 package:
dds <- DESeqDataSetFromMatrix(countData = count_data, colData = col_data, design = ~ condition)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 325 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
During normalization, DESeq2 estimated size factors and
dispersion values, replaced outliers in 325 genes, and applied the
default threshold for the minReplicatesForReplace parameter
(7). The model was then fitted and tested to prepare the dataset for
differential expression analysis.
To assess overall data quality and detect potential batch effects or outliers, I performed exploratory data analysis using three approaches. First, I applied regularized log (rlog) transformation to the raw count matrix to stabilize variance across the range of mean values, which is particularly important for downstream visualization and clustering. Next, I generated a sample-to-sample correlation heatmap to evaluate the similarity in global gene expression profiles across all samples. Finally, I conducted principal component analysis (PCA) to visualize sample clustering and assess separation between the control and SLE groups.
# rlog transformation
rld <- rlog(dds)
# Correlation heatmap
cor_matrix <- cor(assay(rld))
pheatmap(cor_matrix, main = "Sample Correlation Heatmap")
# PCA
plotPCA(rld, intgroup = "condition") + ggtitle("PCA: Ctrl vs SLE")
The sample-to-sample correlation heatmap provides an overview of the global similarity between all RNA-seq samples after regularized log (rlog) transformation. Overall, the heatmap displays high pairwise correlation values (values close to 1) across most samples, indicating a strong and consistent signal throughout the dataset. The clustering pattern generally groups samples by condition (Ctrl vs. SLE), with some intermixing—an expected finding in clinical transcriptomics studies where patient heterogeneity, subtle batch effects, and varying disease activity levels can impact global expression profiles. A few samples, such as SLE_1 and SLE_3, show slightly lower correlations with the rest, which may reflect either biological variability or technical differences during RNA preparation or sequencing. Nonetheless, the consistency in clustering patterns and lack of major outliers suggests that the dataset is of high quality and appropriate for downstream differential expression analysis.
The PCA plot visualizes variance in gene expression across samples, revealing underlying structure in the data. Principal component 1 (PC1) explains 36% of the total variance, and principal component 2 (PC2) accounts for an additional 17%. While there is partial separation between Ctrl and SLE samples along PC1, the modest percentage of variance captured by the first two components suggests that additional biological variation is distributed across lower PCs. This is not uncommon in complex diseases such as SLE, where disease severity, patient heterogeneity, and immune activation states contribute multifactorially to the transcriptomic profile. Future analyses could explore PCs beyond PC2. Nevertheless, the PCA still provides useful insight into overall trends and does create a visual separation between the control and SLE samples.
To identify differentially expressed genes (DEGs) between SLE and control samples, I applied the results() function on the DESeq2 object, using an FDR-adjusted p-value threshold of 0.05. This yielded 5,155 DEGs out of 43,622 genes with nonzero counts, including 3,148 upregulated and 2,007 downregulated genes in SLE. Importantly, I did not apply shrinkage estimation to the log2 fold changes in this step. While shrinkage (e.g., via lfcShrink()) is often useful for effect size visualization and ranking genes for gene set enrichment analysis (GSEA), it is not necessary when the goal is to perform GO term enrichment using a binary significance threshold. Since our focus is on identifying significantly altered genes for categorical GO overrepresentation testing (rather than ranking all genes by effect size), the use of unshrunken fold changes is appropriate and retains the full dynamic range of expression differences. These DEGs will serve as input for downstream functional enrichment analysis to uncover dysregulated biological pathways in SLE.
res <- results(dds, alpha = 0.05)
summary(res)
##
## out of 43622 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 3148, 7.2%
## LFC < 0 (down) : 2007, 4.6%
## outliers [1] : 0, 0%
## low counts [2] : 18638, 43%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
To visualize the distribution of significance values from the
differential expression analysis, I plotted a histogram of the adjusted
p-values (padj) obtained from DESeq2 using the
base R hist function. As shown, there is a clear leftward
skew, with a high concentration of genes displaying low adjusted
p-values. This is consistent with a substantial number of genes being
differentially expressed between SLE and control conditions. The
enrichment of values near 0 indicates strong statistical significance,
while the relatively uniform spread beyond 0.2 suggests a
well-calibrated multiple testing correction. Overall, this distribution
supports the presence of a biologically meaningful signal and validates
the use of GO enrichment downstream.
hist(res$padj,
breaks = 20,
col = "grey",
border = "black",
main = "Adjusted p-values for SLE vs Ctrl",
xlab = "Adjusted p-value",
ylab = "Frequency")
To visualize global expression patterns across differentially expressed genes, I constructed a heatmap of rlog-transformed counts for all genes with an adjusted p-value < 0.05. Row-wise z-score normalization was applied to emphasize relative expression changes across samples. The heatmap reveals distinct clustering between control and SLE samples, with many genes exhibiting consistent upregulation or downregulation in the disease group. Notably, SLE samples cluster tightly together on the right side of the heatmap, while controls show more dispersed patterns on the left, suggesting that gene expression profiles in SLE are more homogeneously altered relative to controls. This clear separation supports the hypothesis that systemic lupus erythematosus is associated with widespread transcriptional reprogramming in the cell type.
# Convert the DESeq2 results object to a data frame and add ENSEMBL IDs as a separate column
res_df <- as.data.frame(res) %>%
rownames_to_column(var = "ENSEMBL")
# Filter for genes with adjusted p-value < 0.05 (significantly differentially expressed)
# and extract their ENSEMBL IDs
sig_genes <- res_df %>%
filter(padj < 0.05) %>%
pull(ENSEMBL)
# Subset the rlog-transformed expression matrix to include only significant genes
rlog_mat <- assay(rld)[sig_genes, ]
# Generate a heatmap of the significant genes using row-based z-score normalization
# Row scaling centers and scales expression values within each gene
# Row names are hidden for legibility
pheatmap(rlog_mat, scale = "row",
show_rownames = FALSE, main = "Heatmap of Significantly Differentially Expressed Genes (SLE vs Ctrl, FDR < 0.05)")
To improve the interpretability of differential gene expression
results, I map Ensembl gene IDs to their corresponding gene symbols
using the org.Hs.eg.db annotation package. First, I clean
the Ensembl identifiers by removing version suffixes (e.g., converting
ENSG0000012345.5 to ENSG0000012345). I then extract the Ensembl IDs from
the DESeq2 results and use the
AnnotationDbi::select() function to retrieve gene symbols
corresponding to each ID. To ensure a one-to-one mapping, I remove any
duplicate entries from the annotation table. Finally, I merge the gene
symbols back into the DESeq2 results dataframe, allowing
downstream tables and visualizations to display gene names rather than
Ensembl IDs.
# Remove version numbers (e.g., ENSG000001234.5 → ENSG000001234)
rownames(res) <- gsub("\\.\\d+$", "", rownames(res))
# Make sure res rownames are Ensembl IDs
ensembl_ids <- rownames(res)
# Create a lookup table
gene_symbols <- AnnotationDbi::select(
org.Hs.eg.db,
keys = ensembl_ids,
columns = c("SYMBOL"),
keytype = "ENSEMBL"
)
# Remove duplicates to keep only one mapping per Ensembl ID
gene_symbols_unique <- gene_symbols[!duplicated(gene_symbols$ENSEMBL), ]
# Merge gene symbols back into your DESeq2 results object
res$symbol <- gene_symbols_unique$SYMBOL[match(rownames(res), gene_symbols_unique$ENSEMBL)]
To visualize the differential expression results, I generate two more commonly used plots: the MA plot and the volcano plot.
The MA plot is created using the plotMA() function from
the DESeq2 package. It displays the log2 fold change (M) on
the y-axis versus the mean of normalized counts (A) on the x-axis,
allowing for the assessment of systematic changes in expression across
the range of expression values. The y-axis is limited to the range of -4
to 4 to focus on the most relevant fold changes.
The volcano plot is generated using the EnhancedVolcano
package. It visualizes statistical significance (adjusted p-values)
versus magnitude of change (log2 fold change). Each point represents a
gene, with the x-axis showing the log2 fold change and the y-axis
showing the negative log10 of the adjusted p-value. I label genes using
their gene symbols (res$symbol) and apply thresholds for
statistical significance (p < 0.05) and biological
relevance (absolute log2 fold change > 1). This provides
an intuitive overview of the most strongly and significantly
differentially expressed genes.
# MA Plot
plotMA(res, ylim = c(-4, 4))
# Volcano plot
EnhancedVolcano(res,
lab = res$symbol,
x = "log2FoldChange",
y = "padj",
pCutoff = 0.05,
FCcutoff = 1,
ylim = c(0, 25),
title = "Differential Expression: SLE vs Ctrl")
In the MA plot, each point represents a gene, plotted by its mean normalized expression (on a log₁₀ scale, x-axis) versus its log₂ fold change in SLE relative to control (y-axis). Most genes cluster tightly around a log₂ fold change of zero, reflecting stability in expression across conditions. However, a prominent number of differentially expressed genes (in blue) lie above and below the center line, especially among genes with high mean expression. This distribution suggests a genuine biological signal rather than noise, as low-expression genes tend to show higher variance and are more prone to false positives. Notably, the plot highlights that a sizable subset of genes is significantly upregulated or downregulated in SLE, pointing to altered transcriptional programs.
The volcano plot sharpens this view by visualizing the relationship between statistical significance (−log₁₀ adjusted p-value, y-axis) and effect size (log₂ fold change, x-axis). Genes with both high statistical significance and large absolute fold change (in red) are the most biologically relevant. In this dataset, several upregulated genes stand out in the upper right quadrant—particularly interferon-stimulated genes (ISGs) such as ISG15, IFI44, IFI44L, USP18, and STAT1. These genes are well-established markers of the type I interferon (IFN-I) pathway, a key immunological signature in SLE. Their upregulation in SLE patients supports our hypothesis and reinforces the pathogenic role of IFN-I signaling in driving immune cell activation, antigen presentation, and chronic inflammation in lupus.
To gain insight into the biological processes associated with the
differentially expressed genes (DEGs), I performed Gene Ontology (GO)
term enrichment analysis using the clusterProfiler package.
First, I extracted the full set of genes tested in the differential
expression analysis (universe_genes) by taking the row
names of the DESeq2 results object (res),
which correspond to Ensembl gene identifiers. I then used the
enrichGO() function to assess whether specific GO
biological processes (BP) are statistically overrepresented among the
significant DEGs (sig_genes, defined by an adjusted p-value < 0.05).
The analysis was conducted with the following parameters:
ont = "BP": I chose the Biological Process (BP)
ontology because it captures high-level functional categories, such as
immune signaling and cell activation, which are particularly relevant in
the context of systemic lupus erythematosus (SLE). BP terms offer
insights into disease-related mechanisms at the process level, making
them ideal for this analysis.OrgDb = org.Hs.eg.db: I used the org.Hs.eg.db database
to supply human gene annotations. Since the dataset is derived from
human CD4+ T cells, this ensures accurate mapping of gene identifiers to
GO terms.keyType = "ENSEMBL": I specified “ENSEMBL” as the key
type because my differential expression results are indexed using
Ensembl gene IDs. This guarantees that identifiers are interpreted
correctly when querying the annotation database.gene = sig_genes: I input the list of significantly
differentially expressed genes (padj < 0.05) as the target gene set
for enrichment. This focuses the analysis on genes showing statistically
meaningful expression differences between SLE and control samples.universe = universe_genes: I defined the universe as
all genes tested in the differential expression analysis. This serves as
a background set for enrichment, reducing bias and ensuring that
observed enrichments are specific to the tested comparison.pAdjustMethod = "BH": I used the Benjamini-Hochberg
(BH) method to correct for multiple testing. This approach controls the
false discovery rate (FDR), which is essential when testing thousands of
GO categories.pvalueCutoff = 0.05: I set a p-value threshold of 0.05
to retain only statistically significant enrichments. This level of
stringency balances discovery with confidence.qvalueCutoff = 0.2: I applied a more lenient q-value
threshold of 0.2 to capture broader patterns that may be biologically
relevant but marginally less significant. This is useful for exploratory
analysis.minGSSize = 5: I excluded GO terms with fewer than 5
associated genes to avoid inflated significance due to small set
sizes.maxGSSize = 800: I filtered out GO terms with more than
800 associated genes to prevent overly generic categories from diluting
the interpretability of the results.This enrichment analysis helps interpret the differential expression results in a functional context by identifying which biological pathways are potentially activated or suppressed in SLE compared to healthy controls.
# Extract universe from rownames of your DESeq2 results dataframe
universe_genes <- rownames(res)
# Remove version numbers from sig_genes
sig_genes <- gsub("\\.\\d+$", "", sig_genes)
# Go term enrichment
go_results <- enrichGO(
gene = sig_genes,
universe = universe_genes,
OrgDb = org.Hs.eg.db,
keyType = "ENSEMBL",
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.2,
minGSSize = 5,
maxGSSize = 800
)
The dotplot below visualizes the top enriched Biological
Process (BP) terms, ranked by adjusted p-value. Each dot represents a GO
term, with the x-axis indicating the ratio of significant genes
annotated to that term (GeneRatio), the color scale representing
statistical significance (adjusted p-value), and the dot size reflecting
the number of genes associated with that term.
Consistent with the interferon-driven immune dysregulation hypothesis in SLE, I observed strong enrichment in pathways related to immune activation and antiviral defense. Terms such as response to virus, regulation of viral process, defense response to virus, and regulation of response to biotic stimulus reflect the upregulation of interferon-stimulated genes (ISGs), a known hallmark of lupus pathophysiology. In addition, I observed increased activity in processes related to the cell cycle, like how cells prepare to divide and organize their chromosomes. This suggests that the CD4+ T cells in SLE patients might be dividing or growing in unusual ways.. These findings align with my hypothesis that SLE CD4+ T cells exhibit enhanced interferon signaling and altered functional programming, and they reinforce the utility of transcriptomic analysis for uncovering systemic immune perturbations in autoimmune disease.
dotplot(go_results, showCategory = 15, font.size = 10, title = "GO: Biological Process")
In this section, I am formatting my Gene Ontology (GO) enrichment
results to be compatible with REVIGO, a tool that summarizes and
visualizes long lists of GO terms by removing redundancy. This file can
now be exported (e.g., with
write.table) and uploaded to
REVIGO to generate a semantic similarity-based summary of enriched GO
terms. This helps reduce redundancy and focus on representative
biological themes.
# Convert enrichGO object to data frame
go_df <- as.data.frame(go_results)
# Filter for adjusted p-value < 0.05
go_filtered <- go_df %>% filter(p.adjust < 0.05)
# Save GO ID and p-value for REVIGO input
revigo_input <- go_filtered %>%
select(ID, p.adjust) %>%
rename(GO_ID = ID, p_value = p.adjust)
# Save as tab-delimited text
write.table(revigo_input, file = "revigo_input.txt", sep = "\t", row.names = FALSE, quote = FALSE)
Revigo provides an R TreeMap output, which I download and load in
using the source method below.
source("revigo/Revigo_BP_TreeMap.R")
This REVIGO treemap provides a high-level summary of the most
significantly enriched Gene Ontology (GO) biological processes
identified in the differentially expressed genes between SLE patients
and healthy controls. Each box represents a semantically distinct GO
term cluster, with larger boxes indicating more significant or
frequently recurring terms, and colors grouping them into similar
functional categories.
Several key biological themes emerge prominently:
Together, this REVIGO summary reinforces the hypothesis that SLE CD4+ T cells exhibit widespread transcriptional dysregulation involving immune signaling, type I interferon response, and altered cell cycle and metabolic states. The visual simplification offered by REVIGO highlights how these biological processes are interrelated, offering clearer insight into the broader systemic changes occurring in SLE.
In this section, I am visualizing the expression of a single gene that is both significantly differentially expressed and associated with a biological process of interest—in this case, “response to virus” (GO:0009615)—to illustrate its behavior across the two experimental conditions (Ctrl vs. SLE).
I start by identifying the list of genes annotated under the GO term
“response to virus” from my enrichment analysis results
(go_results). I then intersect these genes with those that
were significantly differentially expressed in my DESeq2 analysis
(sig_genes) to select a representative gene.
To make the plot more interpretable, I convert the selected Ensembl gene ID into its corresponding gene symbol using the org.Hs.eg.db annotation database. This gene symbol is used in the title of the plot to clearly communicate which gene’s expression is being visualized.
Finally, I use plotCounts() from DESeq2 to generate a
normalized count plot, showing how the expression of this gene differs
between SLE patients and healthy controls. This plot provides a clear,
visual confirmation of differential expression, and helps connect the
statistical results to specific biologically relevant genes.
# Extract genes associated with the GO term of interest
genes_in_go <- go_results@result %>%
filter(ID == "GO:0009615") %>%
pull(geneID) %>%
strsplit("/") %>%
unlist()
# Get first intersecting gene
selected_gene <- intersect(genes_in_go, sig_genes)[1]
# Strip version numbers from rownames of dds (if present)
rownames(dds) <- gsub("\\.\\d+$", "", rownames(dds))
# Retrieve gene symbol for selected Ensembl ID
gene_symbol_df <- AnnotationDbi::select(
org.Hs.eg.db,
keys = selected_gene,
keytype = "ENSEMBL",
columns = c("SYMBOL", "GENENAME")
)
# Extract gene symbol safely (fallback to Ensembl ID if missing)
symbol <- if (!is.na(gene_symbol_df$SYMBOL[1])) gene_symbol_df$SYMBOL[1] else selected_gene
# Plot using gene symbol in title
plotCounts(
dds,
gene = selected_gene,
intgroup = "condition",
normalized = TRUE,
main = paste("Expression of", symbol, "in Ctrl vs SLE")
)
This plot displays the normalized expression levels of ISG15, a well-known interferon-stimulated gene, across control (Ctrl) and systemic lupus erythematosus (SLE) samples. There is a clear and consistent increase in ISG15 expression among SLE patients compared to controls, with very little overlap between the two groups. This upregulation supports the well-established role of type I interferon signaling in SLE pathogenesis and further validates the GO enrichment finding for the “response to virus” biological process. Notably, the variation in ISG15 expression among SLE samples may reflect differences in disease severity or activity across patients, which is consistent with the heterogeneous nature of SLE progression and immune activation.
To generate an IGV visualization, I began by selecting the ISG15 gene within the IGV browser, focusing on its genomic location on chromosome 1. From the available samples, I selected Ctrl_1 and SLE_11 because they had very similar sequencing depth, helping ensure that any differences observed in the tracks would more likely reflect true biological variation rather than technical bias due to coverage.
I then used the samtools view command to subset each BAM file to just the ISG15 locus (chr1:1,011,497-1,016,540), significantly reducing file size and allowing for faster IGV loading. Specifically:
samtools view -b Ctrl_1_Aligned.sortedByCoord.out.bam chr1:1,011,497-1,016,540 > Ctrl_1.ISG15.bam
samtools index Ctrl_1.ISG15.bam
samtools view -b SLE_11_Aligned.sortedByCoord.out.bam chr1:1,011,497-1,016,540 > SLE_11.ISG15.bam
samtools index SLE_11.ISG15.bam
After generating and indexing these BAM files, I loaded them into IGV and colored the reads by first-of-pair strand to highlight strand-specific transcription. I then applied group autoscaling to normalize the y-axis for each track, enabling easier visual comparison of expression levels between the control and SLE sample.
include_graphics("graphics/Ctrl_1_SLE_11_IGV.png")
This IGV snapshot clearly shows increased read coverage over ISG15 in
the SLE_11 sample compared to Ctrl_1, consistent with our earlier
findings from DESeq2 and the plotCounts
function. The denser and more extensive read coverage in SLE_11 confirms
upregulated expression of ISG15 in SLE CD4+ T cells, visually validating
the quantitative differential expression results. Junction reads
(displayed as arcs) are also more prominent in the SLE sample,
indicating increased transcriptional activity and splicing across exons.
These findings reinforce the role of ISG15 in the type I interferon
response and its dysregulation in SLE pathogenesis, supporting both our
hypothesis and previous literature linking interferon signaling to
autoimmune disease activity.
This analysis of CD4⁺ T cell RNA-seq data from patients with systemic lupus erythematosus (SLE) and healthy controls revealed several compelling transcriptional patterns that reinforce the hypothesis and existing biological understanding of the disease. One of the most prominent findings was the strong upregulation of type I interferon-stimulated genes (ISGs), including ISG15, IFI44, IFI44L, and USP18. These genes were among the top differentially expressed candidates, and their elevated expression in SLE samples is consistent with the well-established type I interferon (IFN-I) signature that characterizes immune dysregulation in SLE. Type I interferons play a central role in lupus pathogenesis by promoting autoantibody production, enhancing antigen presentation, and priming autoreactive T and B cells for activation1.
Moreover, gene ontology enrichment analysis using clusterProfiler and visualization with REVIGO revealed that, in addition to interferon signaling, differentially expressed genes were significantly enriched in cell cycle–related processes such as mitotic cell cycle, chromosome organization, and cell cycle phase transition. These findings may reflect heightened activation or proliferative states of circulating CD4⁺ T cells in SLE patients. Prior studies have shown that SLE T-cells can adopt hyperactivated phenotypes, marked by increased metabolic demands, cell cycling, and resistance to apoptosis2. This dysregulated activation may contribute to chronic immune stimulation and tissue damage, further perpetuating the disease process. Together, the enrichment in both IFN-response and cell cycle pathways underscores the complex interplay between external inflammatory signals and intrinsic cellular programming in SLE CD4⁺ T cells.
Several technical challenges emerged during the execution of this project, many of which offered learning opportunities. One of the earliest issues stemmed from the stringency of the STAR alignment parameters, particularly the use of –outFilterMultimapNmax 1, which discards any reads that map to more than one genomic location. While this ensures high-confidence alignments, it is overly conservative for transcriptomic data, where multimapping often reflects biological realities such as gene families, pseudogenes, or paralogs. A more nuanced approach would be to retain multimappers and allow downstream tools (e.g., feature quantifiers or statistical models) to handle them appropriately.
Another significant oversight occurred during the initial run of featureCounts, where I failed to include the –countReadPairs flag. This led to an overestimation of counts by roughly a factor of two, since paired-end reads were counted individually rather than as fragments. I later reran the analysis with the correct flag, ensuring accurate fragment-level quantification. Similarly, I initially neglected to specify the library strandedness using the -s flag in featureCounts, even though the library preparation protocol (TruSeq Stranded mRNA) produces strand-specific data. After identifying this, I corrected the command to use -s 2, ensuring reads were assigned to the correct strand, which is crucial for avoiding ambiguous gene assignments, especially in densely transcribed regions.
Other challenges included handling Ensembl version numbers, which had to be stripped from gene IDs to maintain compatibility across annotation and enrichment tools, and managing inconsistencies in file naming that complicated batch processing scripts. Lastly, while QoRTs provided extensive quality control metrics, interpreting some outputs (such as strand tests and splice events) required revisiting documentation.
This project opens the door to several valuable avenues for future exploration. One natural extension would be to perform differential transcript usage analysis to assess changes at the isoform level, using tools like DEXSeq. These methods could uncover alternative splicing patterns or shifts in isoform expression that are not captured by gene-level analyses. Another promising direction is to stratify SLE patients based on disease severity, treatment status, or other clinical metadata, which could help pinpoint expression signatures associated with flare versus remission or response to therapy. The dataset included 16 samples from patients with less active SLE, so this could certainly be incorporated in a future iteration.
Incorporating single-cell RNA-seq data could also offer more granular insights into T cell heterogeneity. Additionally, integration with epigenomic datasets (e.g., ATAC-seq or ChIP-seq) could help link observed transcriptional changes to upstream regulatory elements and transcription factor binding dynamics. Finally, future work could include machine learning approaches to identify gene expression patterns predictive of disease subtype, treatment response, or prognosis.
In conclusion, this project successfully employed RNA-seq analysis to characterize gene expression differences in CD4⁺ T cells between SLE patients and healthy controls. The results supported the hypothesis and revealed a robust type I interferon signature and activation of cell cycle–related pathways in lupus CD4⁺ T cells, both of which are consistent with previous reports and known disease mechanisms. Despite several technical challenges, careful quality control and methodological revisions helped ensure the reliability of the results. The insights gained here provide a foundation for deeper mechanistic investigations into T cell dysfunction in SLE and support the ongoing development of transcriptomic biomarkers for autoimmune disease monitoring and treatment.
Postal, M., Vivaldo, J. F., Fernandez-Ruiz, R., Paredes, J. L., Appenzeller, S., & Niewold, T. B. (2020). Type I interferon in the pathogenesis of systemic lupus erythematosus. Current opinion in immunology, 67, 87–94. https://doi.org/10.1016/j.coi.2020.10.014↩︎
Hao Li, Afroditi Boulougoura, Yushiro Endo, George C. Tsokos, Abnormalities of T cells in systemic lupus erythematosus: new insights in pathogenesis and therapeutic strategies, Journal of Autoimmunity, Volume 132, 2022, 102870, ISSN 0896-8411, https://doi.org/10.1016/j.jaut.2022.102870.↩︎